| time | topic |
|---|---|
| 1:00 | Data visualisation principles |
| 1:30 | Identifying poor elements in a plot |
| 2:00 | Fixing a plot design |
| 2:30 | BREAK |
🧩 Feel free to ask questions any time. 🤔
🎯 The objectives for today are:
| time | topic |
|---|---|
| 1:00 | Data visualisation principles |
| 1:30 | Identifying poor elements in a plot |
| 2:00 | Fixing a plot design |
| 2:30 | BREAK |
Cleveland and McGill (1984)
Illustrations made by Emi Tanaka
Based on the accuracy with which readers returned the numerical values.
Primary mapping used in common plots
Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.
# load libraries
library(tidyverse)
library(colorspace)
library(patchwork)
library(broom)
library(palmerpenguins)
library(ggbeeswarm)
library(vcd)
library(nullabor)
library(MASS)
library(colorspace)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(dplyr::rename)
conflicts_prefer(dplyr::mutate)
conflicts_prefer(dplyr::summarise)
# prepare data
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
filter(country == "Australia", year > 1996, year < 2013) |>
select(year, contains("new_sp"))
tb_tidy <- tb |>
select(-new_sp, -new_sp_m04, -new_sp_m514,
-new_sp_f04, -new_sp_f514) |>
pivot_longer(starts_with("new_sp"),
names_to = "sexage",
values_to = "count") |>
mutate(sexage = str_remove(sexage, "new_sp_")) |>
separate_wider_position(
sexage,
widths = c(sex = 1, age = 4),
too_few = "align_start"
) |>
filter(age != "u") |>
mutate(age = fct_recode(age, "0-14" = "014",
"15-24" = "1524",
"15-24" = "1524",
"25-34" = "2534",
"35-44" = "3544",
"45-54" = "4554",
"55-64" = "5564",
"> 65" = "65"))
# plot
tb_tidy |>
filter(!(age %in% c("0-14", "unknown"))) |>
ggplot(aes(x=year,
y=count,
colour=sex)) +
geom_point() +
geom_smooth(se=F) +
facet_wrap(~age, ncol = 3) +
scale_color_discrete_divergingx(palette="Geyser") +
scale_x_continuous("year",
breaks = seq(1998, 2012, 2),
labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
theme(axis.text = element_text(size=10)) +
ggtitle("Arrangement A")tb_tidy |>
filter(!(age %in% c("0-14", "unknown"))) |>
ggplot(aes(x = year,
y = count,
colour = age)) +
geom_point() +
geom_smooth(se=F) +
facet_wrap(~sex, ncol = 2) +
scale_color_discrete_divergingx(palette="Zissou 1") +
scale_x_continuous("year",
breaks = seq(1998, 2012, 2),
labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
theme(axis.text = element_text(size=10)) +
ggtitle("Arrangement B")Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.
tb_tidy |>
filter(age %in% c("45-54", "55-64"),
sex == "f") |>
ggplot(mapping=aes(x=year,
y=count)) +
geom_point() +
geom_smooth(aes(colour=age), se=F, method="lm") +
facet_wrap(~age, ncol = 2) +
scale_color_discrete_divergingx(palette="Geyser") +
scale_x_continuous("year",
breaks = seq(1998, 2012, 4),
labels = c("98", "02", "06", "10")) +
theme(legend.position="none",
axis.text = element_text(size=10))tb_tidy |>
filter(age %in% c("45-54", "55-64"),
sex == "f") |>
ggplot(mapping=aes(x=year,
y=count)) +
geom_smooth(aes(colour=age), se=F, method="lm") +
scale_color_discrete_divergingx(palette="Geyser") +
scale_x_continuous("year",
breaks = seq(1998, 2012, 4),
labels = c("98", "02", "06", "10")) +
theme(legend.position="none",
axis.text = element_text(size=10))Help the reader remember what the pattern is in other panels by under-plotting all.
Too many colours, too busy
Can you find the odd one out?
There are three basic choices of palettes:
Which one you choose depends on the
Resources for exploring color:
V1 = tibble(x = 1:7,
native = factor(c("quoll", "emu", "roo",
"bilby", "quokka", "dingo", "numbat")))
ggplot(V1, aes(x=x, y=1, fill=native)) +
geom_tile() +
geom_text(aes(x=x, y=1, label=native)) +
ggtitle("qualitative") +
theme_minimal() +
theme(legend.position = "none",
panel.background =
element_rect(fill = 'transparent', colour = NA),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
#axis.line = element_line(colour = "white"),
panel.grid.major = element_line(colour = "white"),
panel.grid.minor = element_line(colour = "white"))V2 = tibble(x = 1:7,
fill = 1:7)
ggplot(V2, aes(x=x, y=1, fill=fill)) +
geom_tile() +
geom_text(aes(x=x, y=1, label=fill)) +
ggtitle("sequential: emphasise high") +
scale_fill_continuous_sequential(palette = "PinkYl") +
theme_minimal() +
theme(legend.position = "none",
panel.background =
element_rect(fill = 'transparent', colour = NA),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
#axis.line = element_line(colour = "white"),
panel.grid.major = element_line(colour = "white"),
panel.grid.minor = element_line(colour = "white"))V3 = tibble(x = 1:7,
fill = -3:3)
ggplot(V3, aes(x=x, y=1, fill=fill)) +
geom_tile() +
geom_text(aes(x=x, y=1, label=fill)) +
ggtitle("diverging: emphasise high and low") +
scale_fill_continuous_divergingx(palette = "ArmyRose") +
theme_minimal() +
theme(legend.position = "none",
panel.background =
element_rect(fill = 'transparent', colour = NA),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
#axis.line = element_line(colour = "white"),
panel.grid.major = element_line(colour = "white"),
panel.grid.minor = element_line(colour = "white"))❌ Jet rainbow palette
library(vital)
library(viridis)
am <- aus_mortality |>
filter(State == "Victoria",
Sex != "total",
Year < 1980,
Age < 90)
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) +
geom_line() +
facet_wrap(~Sex, ncol=1) +
scale_color_gradientn(colours = rainbow(10)) +
scale_y_log10() +
theme(aspect.ratio = 0.5)Produces false detail, banding and color blindness ambiguity.
✅ viridis palettes
Have a uniform scale, match grey scale ladder. The turbo palette alleviates Jet rainbow palette problems.
❌ Jet rainbow palette
If the variable mapped to colour has a right-skewed distribution, consider transforming it using a log or a square root.
This is the same data, where count has been transformed using square root.
# remotes::install_github("kevinwang09/learningtower")
library(learningtower)
student_data_2018 <- load_student(2018)
student_means <- student_data_2018 |>
group_by(country) |>
summarise(math = mean(math, na.rm=TRUE),
read = mean(read, na.rm=TRUE),
science = mean(science, na.rm=TRUE))
save(student_data_2018,
file="data/student_data_2018.rda")
save(student_means, file="data/student_means.rda")
# Compute differences and bootstrap
student2018_stats <- student_data_2018 %>%
group_by(country) %>%
summarise(mathgap=mean(math[gender=="male"],
na.rm=TRUE)-
mean(math[gender=="female"],
na.rm=TRUE),
wmathgap=weighted.mean(
math[gender=="male"],
w=stu_wgt[gender=="male"],
na.rm=T)-
weighted.mean(
math[gender=="female"],
w=stu_wgt[gender=="female"],
na.rm=T),
readgap=mean(read[gender=="male"],
na.rm=TRUE)-
mean(read[gender=="female"],
na.rm=TRUE),
wreadgap=weighted.mean(
read[gender=="male"],
w=stu_wgt[gender=="male"],
na.rm=T)-
weighted.mean(
read[gender=="female"],
w=stu_wgt[gender=="female"],
na.rm=T))
save(student2018_stats, file="data/student2018_stats.rda")
library(boot)
cimathfn <- function(d, i) {
x <- d[i,]
if (nrow(x) == 0) {
ci <- 0
}
else {
ci <- weighted.mean(x$math[x$gender=="male"],
w=x$stu_wgt[x$gender=="male"], na.rm=T)-
weighted.mean(x$math[x$gender=="female"],
w=x$stu_wgt[x$gender=="female"], na.rm=T)
}
ci
}
cireadfn <- function(d, i) {
x <- d[i,]
if (nrow(x) == 0) {
ci <- 0
}
else {
ci <- weighted.mean(x$read[x$gender=="male"],
w=x$stu_wgt[x$gender=="male"], na.rm=T)-
weighted.mean(x$read[x$gender=="female"],
w=x$stu_wgt[x$gender=="female"], na.rm=T)
}
ci
}
bootmathfn <- function(d) {
if (nrow(d) == 0) {
ci <- c(0, 0)
}
else {
r <- boot(d, statistic=cimathfn, R=200)
l <- sort(r$t)[5]
u <- sort(r$t)[195]
ci <- c(l, u)
}
return(ci)
}
bootreadfn <- function(d) {
if (nrow(d) == 0) {
ci <- c(0, 0)
}
else {
r <- boot(d, statistic=cireadfn, R=200)
l <- sort(r$t)[5]
u <- sort(r$t)[195]
ci <- c(l, u)
}
return(ci)
}
math_results <- student_data_2018 %>%
split(.$country) %>%
purrr::map(bootmathfn)
cnt <- names(math_results)
math_results_tb <- tibble(country = rep(cnt, rep(2, length(cnt))),
ci = rep(c("l", "u"), length(cnt)),
value=unlist(math_results))
math_results_tb <- math_results_tb |>
pivot_wider(names_from = ci, values_from = value) |>
filter(!(l == 0 & u == 0))
read_results <- student_data_2018 %>%
split(.$country) %>%
purrr::map(bootreadfn)
cnt <- names(read_results)
read_results_tb <- tibble(country = rep(cnt, rep(2, length(cnt))),
ci = rep(c("l", "u"), length(cnt)),
value=unlist(read_results))
read_results_tb <- read_results_tb |>
pivot_wider(names_from = ci, values_from = value) |>
filter(!(l == 0 & u == 0))
save(math_results_tb,
file="data/math_results_tb.rda")
save(read_results_tb,
file="data/read_results_tb.rda")load("data/student_means.rda")
student_means_sub <- student_means |>
filter(country %in% c("SGP", "KOR", "POL", "DEU", "NOR", "IRL", "GBR", "IDN", "AUS", "NZL", "USA", "TUR", "PHL", "MAR", "URY", "CHL", "COL", "CAN"))
ggplot(student_means_sub, aes(x=country, y=math)) +
geom_point(colour="#8ACE00", size=4) +
coord_flip() +
xlab("") +
theme(aspect.ratio = 2)Read more about OECD PISA
data(anorexia, package="MASS")
ggplot(data=anorexia,
aes(x=Prewt,
y=Postwt,
colour=Treat)) +
coord_equal() +
xlim(c(70, 110)) +
ylim(c(70, 110)) +
xlab("Pre-treatment weight (lbs)") +
ylab("Post-treatment weight (lbs)") +
geom_abline(intercept=0, slope=1,
colour="grey80", linewidth=1.25) +
geom_density2d() +
geom_point(size=3) +
facet_grid(.~Treat) +
theme(legend.position = "none")ggplot(data=anorexia,
aes(x=Prewt, colour=Treat,
y=(Postwt-Prewt)/Prewt*100)) +
xlab("Pre-treatment weight (lbs)") +
ylab("Percent increase in weight") +
geom_hline(yintercept=0, linewidth=1.25,
colour="grey80") +
geom_point(size=3) +
facet_grid(.~Treat) +
theme(aspect.ratio=1, legend.position = "none")❌ Wrong aspect ratio
ggplot(data=anorexia,
aes(x=Prewt, y=Postwt,
colour=Treat)) +
xlim(c(70, 110)) + ylim(c(70, 110)) +
xlab("Pre-treatment weight (lbs)") +
ylab("Post-treatment weight (lbs)") +
geom_abline(intercept=0, slope=1,
colour="grey80", linewidth=1.25) +
geom_density2d() +
geom_point(size=3) +
facet_wrap(~Treat, ncol=1) +
theme(legend.position = "none",
aspect.ratio = 0.5) #exaggerated
The default aspect ratio in most plots is rectangular.
If you want to compare two quantities, including assessing correlation, the aspect ratio should be square.
Two ways to achieve this with ggplot2:
theme(aspect.ratio=1) PREFERREDcoord_equal()tb_tidy |>
filter(!(age %in% c("0-14", "unknown"))) |>
ggplot(aes(x=year,
y=count,
colour=sex)) +
geom_point() +
geom_smooth(se=F) +
facet_wrap(~age, ncol = 1) +
scale_color_discrete_divergingx(palette="Zissou 1") +
scale_x_continuous("year",
breaks = seq(1998, 2012, 2),
labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
theme(axis.text = element_text(size=10)) +
ggtitle("Wrong aspect ratio")Lines should be on average 45\(^o\).
A choropleth map is used to show a measured variable associated with a political or geographic region. Polygons for the region are filled with colour.
The purpose is to examine the spatial distribution of a variable.
sa2 <- strayr::read_absmap("sa22011") |>
filter(!st_is_empty(geometry)) |>
filter(!state_name_2011 == "Other Territories") |>
filter(!sa2_name_2011 == "Lord Howe Island")
sa2 <- sa2 |> rmapshaper::ms_simplify(keep = 0.5, keep_shapes = TRUE) # Simplify the map!!!
SIR <- read_csv("data/SIR Downloadable Data.csv") |>
filter(SA2_name %in% sa2$sa2_name_2011) |>
dplyr::select(Cancer_name, SA2_name, Sex_name, p50) |>
filter(Cancer_name == "Thyroid", Sex_name == "Females")
ERP <- read_csv("data/ERP.csv") |>
filter(REGIONTYPE == "SA2", Time == 2011, Region %in% SIR$SA2_name) |>
dplyr::select(Region, Value)
# Alternative maps
# Join with sa2 sf object
sa2thyroid_ERP <- SIR |>
left_join(sa2, ., by = c("sa2_name_2011" = "SA2_name")) |>
left_join(., ERP |>
dplyr::select(Region,
Population = Value), by = c("sa2_name_2011"= "Region")) |>
filter(!st_is_empty(geometry))
sa2thyroid_ERP <- sa2thyroid_ERP |>
#filter(!is.na(Population)) |>
filter(!sa2_name_2011 == "Lord Howe Island") |>
mutate(SIR = map_chr(p50, aus_colours)) |>
st_as_sf()
save(sa2, file="data/sa2.rda")
save(sa2thyroid_ERP, file="data/sa2thyroid_ERP.rda")The problem is that high density population areas may be very small geographically. They can disappear in a choropleth map, which means that we get a biased sense of the spatial distribution of a variable.
A cartogram transforms the geographic shape to match the value of a statistic or the population. Its a useful exploratory technique for examining the spatial distribution of a measured variable.
BUT they don’t work for Australia.
# transform to NAD83 / UTM zone 16N
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet=TRUE)
nc <- nc |>
mutate(lBIR79 = log(BIR79))
nc_utm <- st_transform(nc, 26916)
orig <- ggplot(nc) +
geom_sf(aes(fill = lBIR79)) +
ggtitle("choropleth") +
theme_map() +
theme(legend.position = "none")
nc_utm_carto <- cartogram_cont(nc_utm, weight = "BIR74", itermax = 5)
carto <- ggplot(nc_utm_carto) +
geom_sf(aes(fill = lBIR79)) +
ggtitle("cartogram") +
theme_map() +
theme(legend.position = "none")
nc_utm_dorl <- cartogram_dorling(nc_utm, weight = "BIR74")
dorl <- ggplot(nc_utm_dorl) +
geom_sf(aes(fill = lBIR79)) +
ggtitle("dorling") +
theme_map() +
theme(legend.position = "none")
orig + carto + dorl + plot_layout(ncol=1)A hexagon tile map represents every spatial polygon with an equal sized hexagon. In dense areas these will be tesselated, but separated hexagons are placed at centroids of the remote spatial regions.
Now the higher thyroid incidence in Perth suburbs, some Melbourne suburbs, and Sydney are more visible.
if (!file.exists("data/aus_hexmap.rda")) {
## Create centroids set
centroids <- sa2 |>
create_centroids(., "sa2_name_2011")
## Create hexagon grid
grid <- create_grid(centroids = centroids,
hex_size = 0.2,
buffer_dist = 5)
## Allocate polygon centroids to hexagon grid points
aus_hexmap <- allocate(
centroids = centroids,
hex_grid = grid,
sf_id = "sa2_name_2011",
## same column used in create_centroids
hex_size = 0.2,
## same size used in create_grid
hex_filter = 10,
focal_points = capital_cities,
width = 35,
verbose = FALSE
)
save(aus_hexmap,
file = "data/aus_hexmap.rda")
}
load("data/aus_hexmap.rda")
## Prepare to plot
fort_hex <- fortify_hexagon(data = aus_hexmap,
sf_id = "sa2_name_2011",
hex_size = 0.2) |>
left_join(sa2thyroid_ERP |> select(sa2_name_2011, SIR, p50))
## Make a plot
aus_hexmap_plot <- ggplot() +
geom_sf(data=sa2thyroid_ERP, fill=NA, colour="grey60", size=0.1) +
geom_polygon(data = fort_hex, aes(x = long, y = lat, group = hex_id, fill = SIR)) +
scale_fill_identity() +
invthm
aus_hexmap_plot The big change from working with maps in a GIS and maps for data analysis is the SIZE of the map data.
We are going to demonstrate how you need to change your approach, using the code in map.R.
The shapefiles for this example are downloaded from https://imcarto.webflow.io/gdb. It is the “Province” data for Indonesian Administrative boundaries.
Items that are primary elements of a plot:
Organising items:
Conventions:
Calculations:
Backgrounds:
Don’t repeat yourself: no units on each tick mark (e.g. %)
Data pre-processing:
data:
we can do more with more funds?
data:
? Unknown status us decreasing, immune is increasing?
Data:
Screening has not helped reduce TB
data needs normalising. can we compare before screening and after? are similar proportion of population checked, or are these different populations?
interested in trend, but this is harder to read from a bar chart.
mismatch of bars and trend line, better to use points and trend line
? studies are similar numerically, some have substantially more variation
data
observations
# likert example
library(tibble)
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggstats)
library(colorspace)
# generate some data
d <- data.frame(current = factor(rep(c("very unlikely", "unlikely", "neutral", "likely", "very likely"),
c(1,2,3,4,5)),
levels = c("very unlikely", "unlikely", "neutral", "likely", "very likely")),
past = factor(rep(c("very unlikely", "unlikely", "neutral", "likely", "very likely"),
c(5,4,3,2,1)),
levels = c("very unlikely", "unlikely", "neutral", "likely", "very likely")))
# stacked bar
d_long <- d |>
count(current, past) |>
mutate(p = n/15) |>
pivot_longer(cols = c(current, past),
names_to = "var",
values_to = "val")
ggplot(d_long, aes(x=var, fill=val, y=p)) +
geom_col(position="stack") +
scale_fill_discrete_divergingx() +
xlab("") +
theme_minimal() bigpint package: matrix and par coordinatesHeatmaps show all the data, with colour representing a numerical value.
Four chips (variables), 77 genes. How many clusters of genes do you see?
Scatterplot matrix shows that there are no actual clusters here.
Assume:
Display:
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.